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Abstract 



This paper describes a general, trainable architecture for object detection that has previously been applied 
to face and people detection with a new application to car detection in static images. Our technique is a 
learning based approach that uses a set of labeled training data from which an implicit model of an object 
class - here, cars - is learned. Instead of pixel representations that may be noisy and therefore not provide 
a compact representation for learning, our training images are transformed from pixel space to that of 
Haar wavelets that respond to local, oriented, multiscale intensity differences. These feature vectors are 
then used to train a support vector machine classifier. The detection of cars in images is an important step 
in applications such as traffic monitoring, driver assistance systems, and surveillance, among others. We 
show several examples of car detection on out-of-sample images and show an ROC curve that highlights 
the performance of our system. 
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1 Introduction 

This paper describes a trainable system for object de- 
tection in static images with a particular application to 
car detection. This system has previously been applied 
to both face and people detection with success; we high- 
light the generality of the system with this new domain 
of car detection. The detection of cars in images is an 
important step in applications such as traffic monitor- 
ing, driver assistance systems, and surveillance, among 
others. Our approach is to use example based learning; 
we provide the system with a set of training data and it 
learns what a car looks like. Rather than using "stock" 
images of single cars, we consistently use images gath- 
ered from real world scenes. This system currently iden- 
tifies frontal and rear views of cars. 

While it is possible to construct simple models for 
identifying and tracking cars in constrained domains - 
for instance, if we know that our camera will always 
be mounted at a fixed location over a highway - these 
types of systems will have limited use in more general 
applications and conditions. We avoid any handcrafting 
and present a learning based approach to car detection 
that uses a set of labeled training data to derive an im- 
plicit model of cars. Since the pixel images may be noisy 
and therefore not provide a compact representation for 
learning, we use features that respond to local, oriented 
intensity differences in the images; specifically, we use 
a Haar wavelet representation. The car images we use 
for training are transformed from pixel space to wavelet 
space and are then used to train a support vector ma- 
chine classifier. Support vector machines are capable of 
finding optimal separating hyperplanes in high dimen- 
sional spaces with very few training examples. 

The previous work in car detection can be divided into 
approaches that find cars in static images and techniques 
that process video sequences; we first look at some static 
approaches. Bregler & Malik, 1996 [4] describe a system 
that uses mixtures of experts to identify different classes 
of cars. The inputs for classification are a large num- 
ber of second order Gaussian features projected onto a 
smaller dimensional space. This system assumes that the 
cars are already segmented and scaled; it is not a detec- 
tion system, but shares some inspiration (large number 
of intensity difference operators to represent classes of 
objects) with ours. Lipson, 1996 [7] describes a system 
that uses a deformable template for side view car de- 
tection. In this system, the wheels, mid-body region, 
and regions above the wheels are roughly detected based 
on photometric and geometric relations. The wheels are 
then more precisely localized using a Hausdorf match. 
Processing is confined to high resolution images, possibly 
a restriction for more general detection tasks. This sys- 
tem has been applied to scene classification [8] and shares 
some conceptual similarity with that of Sinha [16, 17]. 
Rajagopalan et aL, 1999 [15] have recently developed 
a trainable car detection system that clusters the posi- 
tive data in a high dimensional space and, to classify an 
unknown pattern, computes and thresholds a distance 
measure based on the higher order statistics of the dis- 
tribution. This technique has a good deal in common 
with the face detection system of [19, 20]. 



Motion, or at least the use of multiple frames of rele- 
vant information, contains information that can be used 
to better identify cars; the following systems use dynam- 
ical information in one way or another. Beymer et aL, 

1997 [3] present a traffic monitoring system that has a 
car detection module. This portion of the system locates 
corner features in highway sequences and groups feature 
for single cars together by integrating information over 
time. Since the system operates in a fairly restricted do- 
main, their detection requirements are not as stringent 
as our own. Betke et aL, 1997 [1] and Betke & Nguyen, 

1998 [2] use corner features and edge maps combined 
with template matching to detect cars in highway video 
scenes. This system can afford to rely on motion since it 
is designed for a fairly narrow domain, that of highway 
scene analysis from a vehicle. 

2 System Overview 

The core system we use is a general, trainable object 
detection system that has previously been described in 
[13, 11, 14, 12]. This paper provides further evidence 
that this system is indeed a general architecture by 
adding car detection to the existing applications of face 
and people detection. We note that each of these in- 
stances uses a single framework with no specialized mod- 
ifications to the code; only the training data is different. 

The car detection system uses a database of 516 
frontal and rear color images of cars, normalized to 
128 x 128 and aligned such that the front or rear bumper 
is 64 pixels across. For training, we use the mirror im- 
ages as well for a total of 1,032 positive patterns and 
5,166 negative patterns; a few examples from our train- 
ing database are shown in Figure 1. From the images, 
it should be easy to see that the pixel based representa- 
tions have a significant amount of variability that may 
lead to difficulties in learning; for instance, a dark body 
on a white background and a white body on a dark back- 
ground would have significantly different characteristics 
under a pixel representation. 

To avoid these difficulties and provide a compact rep- 
resentation, we use an overcomplete dictionary of Haar 
wavelets in which there is a large set of features that 
respond to local intensity differences at several orien- 
tations. We present an overview of this representation 
here; details can be found in [9] [18]. 

For a given pattern, the wavelet transform computes 
the responses of the wavelet filters over the image. Each 
of the three oriented wavelets - vertical, horizontal, and 
diagonal - are computed at several different scales allow- 
ing the system to represent coarse scale features all the 
way down to fine scale features. In our system for car 
detection, we use the scales 32 x 32 and 16 x 16. In the 
traditional wavelet transform, the wavelets do not over- 
lap; they are shifted by the size of the support of the 
wavelet in x and y. To achieve better spatial resolution 
and a richer set of features, our transform shifts by ^ 
of the size of the support of each wavelet, yielding an 
overcomplete dictionary of wavelet features. The result- 
ing high dimensional feature vectors are used as training 
data for our classification engine. 

There is certain a priori knowledge embedded in our 




Figure 1: Examples from the database of cars used for training, 
normalized so that the front or rear bumper is 64 pixels wide. 



The images are color of size 128 x 128 pixels, 
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Figure 2: The Haar wavelet framework; (a) the Haar scaling function and wavelet, (b) the three types of 2-dimensional 
non-standard Haar wavelets: vertical, horizontal, and diagonal, and (c) the shift in the standard transform as 
compared to our quadruply dense shift resulting in an overcomplete dictionary of wavelets. 



Figure 3: Ensemble average values of the wavelet features of cars coded using gray level. Coefficients whose values 
are above the average are darker, those below the average are lighter; (a)-(c) are the vertical, horizontal, and diagonal 
wavelets at scale 32 x 32, (d)-(f) are the vertical, horizontal, and diagonal wavelets at scale 16 x 16. 



choice of the wavelets. First, we use the absolute val- 
ues of the magnitudes of the wavelets; this tells the sys- 
tem that a dark body on a light background and a light 
body on a dark background have the same information 
content. Second, we compute the wavelet transform for 
a given pattern in each of the three color channels and 
then, for a wavelet of a specific location and orientation, 
we use the one that is largest in magnitude. This allows 
the system to use the most visually significant features. 
We note that for car detection we are using exactly the 
same set of prior assumptions as for our people detection 
system; this is contrasted with our face detection system 
which operates over grey level images. The characteris- 
tics of this wavelet representation are depicted in Figure 
2. 

The two scales of wavelets we use for detection are 
16 x 16 and 32 x 32. We collapse the three color channel 
features into a single channel by using the maximum 
wavelet response of each channel at a specific location, 
orientation, and scale. This gives us a total of 3,030 
wavelet features that are used to train the SVM. 

The average wavelet feature values are coded in gray 
level in Figure 3. The grey level coding of the aver- 
age feature values show that the wavelets respond to 
the significant visual characteristics of cars: the vertical 
wavelets respond to the sides of the car, the horizontal 
wavelets respond to the roof, underside, top of the grille 
and bumper area, and the diagonal wavelets respond to 
the corners of the car's body. At the scale 16 x 16, we 
can even see evidence of what seems to be license plate 
and headlight structures in the average responses. 

Once we have computed the feature vectors for our 
positive and negative patterns, we use these to train 
a support vector machine (SVM) classifier. SVMs are 
a principled technique to train classifiers that is well- 
founded in statistical learning theory; for details, see 
[21] [5]. Unlike traditional training algorithms like back 
propagation that only minimizes training set error, one 
of the main attractions of using SVMs is that they mini- 
mize a bound on the empirical error and the complexity 
of the classifier, at the same time. In this way, they 
are capable of learning in high dimensional spaces with 
relatively few training examples. 

This controlling of both the training set error and the 
classifier's complexity has allowed support vector ma- 
chines to be successfully applied to very high dimensional 
learning tasks; [6] presents results on SVMs applied to a 



10,000 dimensional text categorization problem. 

3 Experimental Results 

To detect cars in out-of-sample images, we shift the 
128 x 128 window over all locations in the image, com- 
pute the wavelet representation for each pattern, and 
feed it into the SVM classifier to tell us whether or not 
it is a car. To achieve multiscale detection, we itera- 
tively resize the entire image and at each step run the 
fixed size window over the resized images. The shifting 
and wavelet computation can be done more efficiently be 
computing the wavelet representation for an entire image 
once and then shifting in wavelet space. Figure 4 shows 
some examples of our system running over out-of-sample 
images gathered from the internet. 

To obtain a proper characterization of the perfor- 
mance of our system, we present the ROC curve which 
quantifies the tradeoff in detection accuracy and rate of 
false positives in Figure 5. The false positive rate is 
measured as number of false positives per window pro- 
cessed. For an average sized image in our test set (around 
240 x 360) there are approximately 100,000 patterns the 
system processes. For a 90% detection rate, we would 
have to tolerate 1 false positive for every 10,000 patterns, 
or 10 false positives per image. There has been very little 
formal characterization the performance of car detection 
systems in the literature, making it difficult to compare 
our approach to others'. Given the performance we have 
achieved, however, we believe our system will compare 
favorably to existing car detection systems. 

4 Future Work and Conclusion 

This paper has presented a trainable framework for ob- 
ject detection as applied to the domain of car detection. 
While there has been considerable work in face and peo- 
ple detection, much of which uses motion, there has been 
relatively little work in car detection in static images. 
The framework we describe is indeed general and has 
successfully been applied to both face, people, and now 
car detection. This success can be attributed to our use 
of an effective representation that smooths away noise 
and while capturing the important aspects of our object 
class and the use of the support vector machine classifier 
that allows the system to learn in a 3,030 dimensional 
space with only 6,198 examples (1,032 positive and 5,166 
negative). 




Figure 4: Results of car detection on out-of-sample images. A is from www.lewistonpd.com; B, C, D, E, F, G, H, 
J, K, L, M, O are from www.corbis.com; I is from www.enn.com; N is from www.foxglove.com. Missed positive 
examples are due to occlusions (A, F, O) or where a car is too close to the edge of the image (A). False positives (C, 
J, I, N) are due to insufficient training and can be eliminated with more negative training patterns. 
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Figure 5: ROC curve for car detection using wavelet 
features over color images. 



Extending this to detect arbitrary poses of cars may 
be difficult as side view poses have much higher vari- 
ability frontal and rear views. One possible solution to 
this is to use a component-based approach that iden- 
tifies wheels, windows, and other identifiable parts in 
the proper geometric configuration; a version of this ap- 
proach is applied to people detection in [10]. 
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